Missing data arise in genetic association studies when genotypes are unknown or whenhaplotypes are of direct interest. We provide a general likelihood-based framework formaking inference on genetic effects and gene–environment interactions with suchmissing data. We allow genetic and environmental variables to be correlated while leavingthe distribution of environmental variables completely unspecified. We consider 3 majorstudy designs—cross-sectional, case–control, and cohort designs—andconstruct appropriate likelihood functions for all common phenotypes (e.g.case–control status, quantitative traits, and potentially censored ages at onset ofdisease). The likelihood functions involve both finite- and infinite-dimensionalparameters. The maximum likelihood estimators are shown to be consistent, asymptoticallynormal, and asymptotically efficient. Expectation–Maximization (EM) algorithms aredeveloped to implement the corresponding inference procedures. Extensive simulationstudies demonstrate that the proposed inferential and numerical methods perform well inpractical settings. Illustration with a genome-wide association study of lung cancer isprovided.
展开▼